tensor processing unit
Why Google's custom AI chips are shaking up the tech industry
Why Google's custom AI chips are shaking up the tech industry Ironwood is Google's latest tensor processing unit Nvidia's position as the dominant supplier of AI chips may be under threat from a specialised chip pioneered by Google, with reports suggesting companies like Meta and Anthropic are looking to spend billions on Google's tensor processing units. The success of the artificial intelligence industry has been in large part based on graphical processing units (GPUs), a kind of computer chip that can perform many parallel calculations at the same time, rather than one after the other like the computer processing units (CPUs) that power most computers. 'Flashes of brilliance and frustration': I let an AI agent run my day GPUs were originally developed to assist with computer graphics, as the name suggests, and gaming. "If I have a lot of pixels in a space and I need to do a rotation of this to calculate a new camera view, this is an operation that can be done in parallel, for many different pixels," says Francesco Conti at the University of Bologna in Italy. This ability to do calculations in parallel happened to be useful for training and running AI models, which often use calculations involving vast grids of numbers performed at the same time, called matrix multiplication.
Flex-TPU: A Flexible TPU with Runtime Reconfigurable Dataflow Architecture
Elbtity, Mohammed, Chandarana, Peyton, Zand, Ramtin
Tensor processing units (TPUs) are one of the most well-known machine learning (ML) accelerators utilized at large scale in data centers as well as in tiny ML applications. TPUs offer several improvements and advantages over conventional ML accelerators, like graphical processing units (GPUs), being designed specifically to perform the multiply-accumulate (MAC) operations required in the matrix-matrix and matrix-vector multiplies extensively present throughout the execution of deep neural networks (DNNs). Such improvements include maximizing data reuse and minimizing data transfer by leveraging the temporal dataflow paradigms provided by the systolic array architecture. While this design provides a significant performance benefit, the current implementations are restricted to a single dataflow consisting of either input, output, or weight stationary architectures. This can limit the achievable performance of DNN inference and reduce the utilization of compute units. Therefore, the work herein consists of developing a reconfigurable dataflow TPU, called the Flex-TPU, which can dynamically change the dataflow per layer during run-time. Our experiments thoroughly test the viability of the Flex-TPU comparing it to conventional TPU designs across multiple well-known ML workloads. The results show that our Flex-TPU design achieves a significant performance increase of up to 2.75x compared to conventional TPU, with only minor area and power overheads.
HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning
Lee, Heejun, Park, Geon, Lee, Youngwan, Kim, Jina, Jeong, Wonyoung, Jeon, Myeongjae, Hwang, Sung Ju
In modern large language models (LLMs), increasing sequence lengths is a crucial challenge for enhancing their comprehension and coherence in handling complex tasks such as multi-modal question answering. However, handling long context sequences with LLMs is prohibitively costly due to the conventional attention mechanism's quadratic time and space complexity, and the context window size is limited by the GPU memory. Although recent works have proposed linear and sparse attention mechanisms to address this issue, their real-world applicability is often limited by the need to re-train pre-trained models. In response, we propose a novel approach, Hierarchically Pruned Attention (HiP), which simultaneously reduces the training and inference time complexity from $O(T^2)$ to $O(T \log T)$ and the space complexity from $O(T^2)$ to $O(T)$. To this end, we devise a dynamic sparse attention mechanism that generates an attention mask through a novel tree-search-like algorithm for a given query on the fly. HiP is training-free as it only utilizes the pre-trained attention scores to spot the positions of the top-$k$ most significant elements for each query. Moreover, it ensures that no token is overlooked, unlike the sliding window-based sub-quadratic attention methods, such as StreamingLLM. Extensive experiments on diverse real-world benchmarks demonstrate that HiP significantly reduces prompt (i.e., prefill) and decoding latency and memory usage while maintaining high generation performance with little or no degradation. As HiP allows pretrained LLMs to scale to millions of tokens on commodity GPUs with no additional engineering due to its easy plug-and-play deployment, we believe that our work will have a large practical impact, opening up the possibility to many long-context LLM applications previously infeasible.
Google's new AI supercomputer is 'a unique approach to AI development, claims expert
Google recently announced they have developed a unique artificial intelligence (AI) supercomputer that is faster, more efficient, and more powerful than NVIDIA systems. Nvidia is the reigning champion of AI model training and deployment, dominating over 90% of the market, according to CNBC. The great AI race has been raging on for a while now in Big Tech, and Google has been developing AI chips called Tensor Processing Units (TPUs) since 2016. "Google has chosen a unique approach to AI development by creating its own'Tensor Processing Unit' (TPU) architecture, rather than relying on specialised GPUs [graphic processing units] from Nvidia," founder of Elo AI, Matt Falconer explains. "This decision allows Google to reduce their dependence on third-party vendors and achieve vertical integration across its entire AI stack," Falconer added.
Coffee corner: are deep learning's returns diminishing?
This month, we discuss an article that appeared recently in IEEE Spectrum entitled: Deep learning's diminishing returns. The article reports that deep-learning models are becoming more and more accurate, but the computing power needed to achieve this accuracy is increasing at such a rate that, to further reduce the error rates, the cost and environmental impact is going to be unsustainably high. Joining the discussion this time are: Tom Dietterich (Oregon State University), Stephen Hanson (Rutgers University), Sabine Hauert (University of Bristol), and Sarit Kraus (Bar-Ilan University). Sarit Kraus: I would like to start by considering the research aspect. Suppose a PhD student has a great idea about how to improve some machine learning algorithm. So now, they need to show that this improved algorithm is much better than all those before.
Hardware Acceleration of Explainable Machine Learning using Tensor Processing Units
Machine learning (ML) is successful in achieving human-level performance in various fields. However, it lacks the ability to explain an outcome due to its black-box nature. While existing explainable ML is promising, almost all of these methods focus on formatting interpretability as an optimization problem. Such a mapping leads to numerous iterations of time-consuming complex computations, which limits their applicability in real-time applications. In this paper, we propose a novel framework for accelerating explainable ML using Tensor Processing Units (TPUs).
Hardware Acceleration of Explainable Machine Learning using Tensor Processing Units
Machine learning (ML) is successful in achieving human-level performance in various fields. However, it lacks the ability to explain an outcome due to its black-box nature. While existing explainable ML is promising, almost all of these methods focus on formatting interpretability as an optimization problem. Such a mapping leads to numerous iterations of time-consuming complex computations, which limits their applicability in real-time applications. In this paper, we propose a novel framework for accelerating explainable ML using Tensor Processing Units (TPUs). The proposed framework exploits the synergy between matrix convolution and Fourier transform, and takes full advantage of TPU's natural ability in accelerating matrix computations. Specifically, this paper makes three important contributions. (1) To the best of our knowledge, our proposed work is the first attempt in enabling hardware acceleration of explainable ML using TPUs. (2) Our proposed approach is applicable across a wide variety of ML algorithms, and effective utilization of TPU-based acceleration can lead to real-time outcome interpretation. (3) Extensive experimental results demonstrate that our proposed approach can provide an order-of-magnitude speedup in both classification time (25x on average) and interpretation time (13x on average) compared to state-of-the-art techniques.
Powerful Photon-Based Processing Units Enable Complex Artificial Intelligence
The photonic tensor core performs vector-matrix multiplications by utilizing the efficient interaction of light at different wavelengths with multistate photonic phase change memories. Using photons to create more powerful and power-efficient processing units for more complex machine learning. Machine learning performed by neural networks is a popular approach to developing artificial intelligence, as researchers aim to replicate brain functionalities for a variety of applications. A paper in the journal Applied Physics Reviews, by AIP Publishing, proposes a new approach to perform computations required by a neural network, using light instead of electricity. In this approach, a photonic tensor core performs multiplications of matrices in parallel, improving speed and efficiency of current deep learning paradigms.
Google claims its new TPUs are 2.7 times faster than the previous generation
Google's fourth-generation tensor processing units (TPUs), the existence of which weren't publicly revealed until today, can complete AI and machine learning training workloads in close-to-record wall clock time. That's according to the latest set of metrics released by MLPerf, the consortium of over 70 companies and academic institutions behind the MLPerf suite for AI performance benchmarking. It shows clusters of fourth-gen TPUs surpassing the capabilities of third-generation TPUs -- and even those of Nvidia's recently released A100 -- on object detection, image classification, natural language processing, machine translation, and recommendation benchmarks. Google says its fourth-generation TPU offers more than double the matrix multiplication TFLOPs of a third-generation TPU, where a single TFLOP is equivalent to 1 trillion floating-point operations per second. It also offers a "significant" boost in memory bandwidth while benefiting from unspecified advances in interconnect technology.
Tensor Processing Unit (TPU) technical paper.
A Tensor Processing Unit (TPU) is an Accelerator Application-Specific integrated Circuit (ASIC) developed by Google for Artificial Intelligence and Neural Network Machine Learning. With Machine Learning gaining its relevance and importance every day, the conventional microprocessors have known to be unable to effectively handle the computations be it training or neural network processing. The 1st Generation TPU is a hardware chip used at Google data center for faster computation. The 2nd generation TPU is now available in cloud and empowers businesses everywhere to access this accelerator technology to speed up their machine learning workloads using its high speed network. The 3rd generation TPU is twice as powerful as its previous generation and this result in an 8-fold increase in performance.